Fragment contribution models for predicting skin permeability using HuskinDB

Mathematical models to predict skin permeation tend to be based on animal derived experimental data as well as knowing physicochemical properties of the compound under investigation, such as molecular volume, polarity and lipophilicity. This paper presents a strikingly contrasting model to predict permeability, formed entirely from simple chemical fragment (functional group) data and a recently released, freely accessible human (i.e. non-animal) skin permeation database, known as the ‘Human Skin Database – HuskinDB’. Data from within the database allowed development of several fragment-based models, each including a calculable effect for all of the most commonly encountered functional groups present in compounds within the database. The developed models can be applied to predict human skin permeability (logKp) for any compound containing one or more of the functional groups analysed from the dataset with no need to know any other physicochemical properties, solely the type and number of each functional group within the chemical structure itself. This approach simplifies mathematical prediction of permeability for compounds with similar properties to those used in this study.


Introduction
The rate and extent of permeation through human skin is a fundamental property that must be determined for any compound that may come into contact with skin, including a plethora of chemicals found in cosmetics and pharmaceutical products.In some cases this permeation may be desirable, such as transdermal drug delivery systems 1 , yet in other cases it should be avoided, such as for cosmetics and sun protection products 2 .Experimental determination of permeation through human skin is a complex and expensive process with alternatives frequently used such as animal skin 3 , even though these are known to often be unreliable predictive systems and bring their own issues regarding storage, preparation and predictive ability 4 .Thirty years ago, as an alternative to experimental determination, researchers began to consider mathematical models for predicting skin permeability including the well-known 'Potts and Guy' model 5 .In more recent years in silico-based systems have become more widespread comprising a range of computational approaches [6][7][8] , such as machine learning methods 9 and quantitative structure-permeability relationships (QSPRs) that relate skin permeability to physicochemical properties and structural descriptors 10,11 .The vast majority of these models have relied on properties for a compound that are either well-known, for example molecular weight 12 , along with properties that can be predicted, for example lipophilicity and polar surface area 13 .Therefore, for a potentially permeable compound such models require a range of information to be known which may not be available.Very few studies have thought to try and simplify the structure-permeability relationship by quantifying group contributions from functional groups within the compound as a correlatable property.One study partially analysed functional group effects on the permeability of hydrocortisone esters from the perspective of their free energy of transfer of solute into the rate-limiting barrier of the stratum corneum but did not expand the relationship beyond this group of compounds 14 .A second study by the same authors analysed the functional group effects on permeability of methyl-substituted p-creosols yet again, did not expand the concept beyond this series of compounds 15 .In another study a range of properties were considered for predicting permeability including the number of carbon atoms present and some functional groups, although the latter were then mainly excluded as not significant within the dataset used, possibly as the dataset was comparatively small (n = 91) and based on animal (as opposed to human) skin data 16 .
In other analytical in silico scenarios fragment-based approaches have been used to predict useful information, for example for molecular property prediction 17 , and more relevant to this study, to predict permeation through the blood-brain barrier (BBB) 18 whereby the models developed were noted as useful for the identification, selection and design of new drug candidates.A variety of additional uses of the group contribution approach can also be found in the literature, for example to predict the properties of aerosols 19 and the well-established estimation of partition coefficients 20,21 .To be able to create such a model requires a comprehensive dataset of experimental data which may include information on experimental uncertainty with chemical descriptors 22 .For this study a freely accessible and comprehensive dataset of skin permeation data was used that solely comprised of human skin data with incorporated experimental parameters for each value included, known as HuskinDB 23 , making it the most relevant dataset available for consideration.Previous work from our group has established several models for predicting skin permeability using this dataset 24 yet the models required knowledge of the physicochemical properties of the compound under investigation, such as partition coefficient (logP), topological surface area (TPSA) and molecular volume (MV).This study negates the need for such information by correlating skin permeation with only the knowledge of the type and number of functional groups present within the molecule in question, thus simplifying the predictive process immensely.

Results
Firstly, a QSPR model was initially created using permeability coefficient (logK p ) values from the HuskinDB dataset for 180 compounds containing ten functional groups (as listed in the Methods section) to confirm the validity of the concept.This dataset was selected from the original full dataset to avoid seven 'unusual' functional groups (boron, cyanide, epoxide, fluorine, nitro, phosphate and thiol) where only a small number of compounds included these groups and it was deemed insufficient data to create a reliable contribution for those particular groups.Where more than one logK p value was available for a compound a series of experimental parameters were chosen to reduce the value to one, selected to most closely reflect those experienced in vivo, namely: abdomen source, epidermis and dermis layers, concentrated solute and an experimental donor solution temperature 31-35 °C.An experimental donor pH between 7 and 7.5 was selected to maximise the dataset as the majority of compounds that had specified pH only had data in this pH range available.The ten most commonly-encountered functional groups within the dataset (amide, amine, aromatic, bromine, carboxylic acid, chlorine, ester, ether, hydroxyl and ketone) were independently correlated with logK p for each of the training set compounds (n = 144) to produce an equation (Eq. 1) that considers the individual contributions of each fragment to the overall permeation value.Coefficients of determination (R 2 ) and root mean square error (RMSE) values for the training set (n = 144) and subsequent test set (n = 36) are presented along with Eq. 1 which displays the contribution for each functional group analysed.It should be noted that this equation also takes into consideration the prevalence of each functional group present in the molecule, for example if the compound contains two aromatic groups then the contribution value should be calculated as (+0.186 × 2).(1) Although the resultant equation allows prediction for permeation for the first time using group contributions for any compound that contains one or more of the ten functional groups included, and the training and test sets both produced a reasonable correlation it is not as high as some models using physicochemical properties seen by others, such as that of Moss and Cronin's analysis of steroids (n = 116, R 2 = 0.82) 25 or Magnusson et al. with an equation based on molecular weight alone (n = 87, R 2 = 0.847) 26 .
For this reason, a second model was established to understand the relationship between predictive ability and experimental conditions.To achieve this, permeation data was divided into four experimental categories: skin source (breast/abdomen/thigh), skin type (epidermis/epidermis + dermis/dermis/stratum corneum), donor concentration (dilute/saturated) and experimental temperature (20-25/26-30/31-35/36-40 °C).A summary of the resultant equations with functional group contributions is displayed in Table 1.
From all of the potential models created in Table 1, the most suitable for use is that which has a comparatively high number of compounds within the dataset and yet also as high as possible R 2 .Combining these two aspects ensures the chosen model will have both wide applicability for a range of compounds and a good degree of correlation, i.e., good predictive ability.Based on the data in Table 1 the most suitable model to meet these criteria is that based on abdomen skin with only the epidermis permeated and in a diluted donor solution at 31-35 °C.Under these conditions the number of compounds and coefficient of determination are both comparatively reasonable (from within the ranges displayed in Table 1), thus this model was selected as the most suitable.As before, the total number of compounds was separated into a training set (to derive Eq. 2) with associated coefficients of determination and root mean square error (RMSE) values for both the training and subsequent test sets.
(2) This model could be simplified even further by removing the bromine contribution as with a value of zero it is unnecessary for inclusion.Figure 1 displays the relationship between the predicted and experimental logK p values for the 36 compounds analysed using Eq. 2 based upon HuskinDB logarithmic K p values expressed in cm/s.

Discussion
As visualised in Fig. 1, there is a clear correlation between the predicted logK p values and those found experimentally from HuskinDB, confirming the relationship between the functional groups present in a compound and their influence on permeation.Even though the dataset for Eq. 2 was far smaller than that for Eq. 1, Eq. 2 still included a range of compounds that included all of the functional groups under investigation.The R 2 value obtained with the training set, and even more importantly the test set, indicate that this model is far superior to Eq. 1 from the far larger dataset.In comparison with our previously proposed model 24 that utilised physicochemical data for each compound, the model presented in Eq. 2 can be considered superior (despite the smaller dataset) based upon the higher R 2 values (0.7125 and 0.8931 vs. 0.5042 and 0.5057) and lower RMSE values (0.71 and 0.49 vs. 0.73 and 0.84) for the training and test sets respectively.Statistical significance using a two-tailed t-distribution for the training and test sets in Eq. 2 was further confirmed whereby ρ was calculated to be 8.7 × 10 −9 (n = 29) and 1.3 × 10 −3 (n = 7) respectively, i.e. far smaller than the standard accepted limit of 0.05.  1. QSPR models for skin permeability (logK p ) prediction using data extracted from HuskinDB based upon the ten most commonly encountered functional groups within the compounds analysed.Where a group has no contribution to the equation it has been excluded from the final equation listed.
As this is the first model of its kind (using functional groups to predict permeation) there are limitations in appropriate models for comparison.Figure 1 includes a baseline model produced using a combination of the permeation values extracted from HuskinDB and the average permeation value from all of the HuskinDB data used (−6.14),presented as a horizontal relationship compared with the far more linear relationships observed for the training and test datasets.To further corroborate the findings, RMSE values for both training and test datasets were calculated using mean baseline models.For the training set used in Eq. 1 (n = 144) and Eq. 2 (n = 29) the mean baseline models provided RMSE values of 1.07 and 1.31 respectively, far higher than those calculated from the equations themselves.This indicates that using the models will provide a better prediction of permeation compared with simply taking an average value from the dataset.
Furthermore, this model is also more suitable than those published by others, such as the well-known 'Potts and Guy' model 5 (R 2 = 0.67) or the United States Environmental Protection Agency DERMWIN ™ model 27 (R 2 = 0.66), both based on partition coefficient and molecular weight data, as opposed to functional group data.
The same set of compounds as those used in the training and test sets for Eq. 2 were then analysed using the 'Potts and Guy' model 5 and the DERMWIN ™ model (log Kp (cm/h) = −2.80+ 0.66 logKow -0.0056 MW) 27 .
For both models an R 2 of 0.504 was calculated for the training sets and 0.737 and 0.738 for the test sets, i.e. all four values were lower than obtained for Eq. 2. Both models also exhibited higher RMSE values of 1.23 and 1.17 for the training sets with 1.15 and 1.10 for the test sets respectively, i.e. higher than obtained for Eq. 2.
Even when compared alongside a far more complex QSAR model based on substructural molecular fragments that considers types of bonds (single/double/triple) 28 , our model performed well with a higher test set R 2 value (0.893 vs. 0.630).Furthermore, the aforementioned publication does not specify the exact values of the separate contributions.For example, our 'constant' contribution in Eq. 2 is defined as −4.916 (similar to their value of approximately −5) yet their exact values for each fragment contribution are not provided.Their lack of inclusion of specific values does not facilitate the same level of usefulness for readers to facilitate permeation calculation that our approach provides.Therefore, we propose that our model could be used to predict permeation for any compound that contains one or more functional groups within the compound and no other physicochemical information is required.From a practical perspective, it is envisaged that the model can be applied very simply by a researcher once they have identified the chemical composition of their compound under investigation.From this point they can then use the model to calculate the overall contribution for the groups and insert that into the equation to achieve a predicted permeation value, as summarised for a model compound (cytarabine) in Fig. 2.
Using the model in this way transforms the theoretical concept to a practical and useful tool for researchers to use when wishing to predict permeation, i.e., taking a dataset, transforming it into a model and then confirming its suitability for predictive purposes.It could be argued that there are limits on the range of compounds that can be predicted from such equations, for example only for those with similar molecular mass or lipophilicity ranges to those in the dataset.At this time, it is not possible to confirm how far beyond the included range of properties the model would be reliable thus reasonable caution should be taken when extending beyond such limits.Furthermore, the authors acknowledge the limited size of the dataset (which can introduce stochastic effects), and that larger datasets would provide access to more sophisticated models such as random forests.In summary, this approach dramatically simplifies mathematical prediction whilst also ensuring the obtained values are human-relevant and therefore offers an exciting way forward for simple, yet precise, permeability prediction for a wide variety of compounds.

Methods
All K p values (cm/s) analysed in the study were considered as logK p values from within the HuskinDB database 29 , expressed as logarithmic (logK p ) values as this is standard procedure.logK p values were analysed with the ten most commonly encountered functional groups in the dataset: amide, amine, aromatic, bromine, carboxylic acid, chlorine, ester, ether, hydroxyl and ketone.
Our goal was to fit a multiple regression model of the form 30 : to the data available, Fig. 2 An example of how the mathematical model can be applied for a given compound using Eq. 2, illustrated using a compound from within HuskinDB (cytarabine) to highlight the correlation between the experimental and predicted values.
where a 0 is the best estimate of logK p if no information about the functional groups is present or they have no effect (equivalent to the intercept of a simple regression model) and a j are the parameters relating to each property, the amount that is added or subtracted to estimate logK p due to the presence of each functional group of type j present in the molecule of interest.y is logK p for the molecule and x j the number of functional groups j present in the molecule.This is done by minimising the quantity q, which is the sum of the deviation between observed and predicted values of logK p squared: It can be shown that the values of a j that optimise this expression, when expressed in matrix form, are: where a is a column vector containing the n + 1 fitted parameters a 0 , a 1 , …. a n Y is a column vector containing the m observed values of logK p X is matrix with m rows and n + 1 columns.Each row containing the number of each of the n functional groups present, with the first column being filled with 1's as there is no data associated with the a 0 term (it is 1 a 0 , rather than x 0 a 0 ).
X T is the same matrix transposed so it has m columns and n + 1 rows.This is required, to allow the matrices to be multiplied.
Once the estimates for the parameters a 0 , a 1 … had been determined a method was required to test the goodness of fit, whether the parameter is statistically different to it being zero or whether incorporating a term associated with a specific functional group gives anything significant to the model.This is done, in two ways.Firstly, an ANOVA table was constructed which tests the hypothesis that all the fitted values are equal to zero by calculating the F statistic and its associated probability.Using the same matrix notation as above, it can be shown that the maximum likelihood estimate for the standard deviation in the fitted values of y is: and the fitted parameter is considered not significantly different to zero if: where c jj is the jth diagonal element of the square matrix (X T .X) −1 used in fitting the regression parameters and t m−n is the t statistic on m−n degrees of freedom at the required level of significance.This analytical approach seemed suitable as it has been previously applied to a wide variety of applications [31][32][33][34][35] .Having determined a method for fitting the regression parameters and then assessing their significance, the next task was to determine the most appropriate parameters to include in the model.For this we adopted the 'top down' approach whereby a regression model was fitted using all possible parameters and the least significant was then removed (or each parameter that is not significantly different to zero removed in turn) and a model with one fewer parameter fitted.This process was repeated until all the parameters included in the model were significantly different to zero at the required level.Two approaches were adopted for creation of an optimised model using different subsets of data.Firstly, 180 compounds from the dataset were included for analysis (after removal of extreme outliers and unusual functional groups).Secondly, as with our previous study 24 , four experimental variables: skin source (breast/ abdomen/ thigh), skin layer used (epidermis/ dermis/ epidermis + dermis/ stratum corneum), concentration of donor solution (neat/diluted) and donor solution temperature (20-25/26-30/31-35/36-40 °C) were considered.As before, these four variables included a total of 96 scenarios yet only 27 were analysed (i.e., where n > 1), with 71 compounds from the 253 in total excluded as they did not fulfil the requirement to have at least one specified experimental variable.It should be noted that some of the remaining compounds were sometimes considered in more than one scenario where multiple logK p values were provided under different experimental variables.
Multiple linear regression analysis (using Microsoft Excel (Data Analysis), Microsoft 365 ® ) with the ten functional groups created models with their associated coefficients of determination (R 2 ).Data was divided randomly into training (80%) and test (20%) sets using the training set to form an equation for each model which was then reviewed using the associated test set.The decision to use an 80:20 split was chosen to follow that used in our previous work, which itself was selected based on supporting literature 35 .For comparative analysis with existing models all calculated logP/logKow and MW values were extracted from (www.molinspiration.com 36) for consistency.